Skip to content

[ES-1717770] Fix TIMEDOUT_STATE not recognized as error on interactive clusters#1244

Open
samikshya-db wants to merge 2 commits intodatabricks:mainfrom
samikshya-db:fix/timedout-state-interactive-cluster
Open

[ES-1717770] Fix TIMEDOUT_STATE not recognized as error on interactive clusters#1244
samikshya-db wants to merge 2 commits intodatabricks:mainfrom
samikshya-db:fix/timedout-state-interactive-cluster

Conversation

@samikshya-db
Copy link
Collaborator

@samikshya-db samikshya-db commented Mar 2, 2026

Description

Follow-up to #1199 (ES-1717770). The previous fix covered the case where FetchResults returns an error status with sqlState=57KD0. However, the interactive cluster path was still broken.

Root cause: When using interactive clusters with enableDirectResults=true, the cluster can enforce its own server-side query timeout and return TIMEDOUT_STATE directly in directResults.operationStatus — before the client's polling loop ever starts. Because isErrorOperationState did not include TIMEDOUT_STATE, the driver:

  1. Did not throw in checkOperationStatusForErrors
  2. shouldContinuePolling(TIMEDOUT_STATE) returned false → polling loop never started → TimeoutHandler never fired
  3. Fell through to executeFetchRequest → server returned an error → driver threw DatabricksHttpException instead of DatabricksTimeoutException

The same gap also affects the polling path when GetOperationStatus returns TIMEDOUT_STATE during polling.

Fix:

  • Add TIMEDOUT_STATE to isErrorOperationState
  • Throw DatabricksTimeoutException for TIMEDOUT_STATE in checkOperationStatusForErrors regardless of whether sqlState is set (interactive clusters do not always populate it)

Testing

  • testTimedOutStateInDirectResultsThrowsTimeoutException — Pavan's exact repro: server returns TIMEDOUT_STATE in directResults before polling starts
  • testTimedOutStateDuringPollingThrowsTimeoutException — server returns TIMEDOUT_STATE during polling

Additional Notes

The original ES-1717770 verification test passed on both warehouse and all-purpose cluster because setQueryTimeout(1) with a long-running query caused the server to return RUNNING_STATE first (query still in-flight), entering the polling loop where TimeoutHandler fired correctly. Pavan's repro consistently hits the other path: the cluster's own timeout fires first, returning TIMEDOUT_STATE directly, bypassing the polling loop entirely.

When using interactive clusters with enableDirectResults=true, the server
can return TIMEDOUT_STATE directly in directResults.operationStatus when
the cluster's own query timeout fires before the client's polling loop
starts. Because TIMEDOUT_STATE was not included in isErrorOperationState,
the driver silently fell through to executeFetchRequest and threw
DatabricksHttpException instead of DatabricksTimeoutException.

Fix isErrorOperationState to include TIMEDOUT_STATE, and update
checkOperationStatusForErrors to throw DatabricksTimeoutException for
TIMEDOUT_STATE regardless of whether sqlState is set, since interactive
clusters do not always populate the SQL state field.

Add tests covering:
- TIMEDOUT_STATE in directResults (server timeout fires before polling starts)
- TIMEDOUT_STATE returned during polling

Signed-off-by: Samikshya Chand <samikshya.chand@databricks.com>
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@samikshya-db samikshya-db changed the title Fix TIMEDOUT_STATE not recognized as error on interactive clusters [ES-1717770] Fix TIMEDOUT_STATE not recognized as error on interactive clusters Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant